data warehouse
Query Logs Analytics: A Aystematic Literature Review
In the digital era, user interactions with various resources such as databases, data warehouses, websites, and knowledge graphs (KGs) are increasingly mediated through digital platforms. These interactions leave behind digital traces, systematically captured in the form of logs. Logs, when effectively exploited, provide high value across industry and academia, supporting critical services (e.g., recovery and security), user-centric applications (e.g., recommender systems), and quality-of-service improvements (e.g., performance optimization). Despite their importance, research on log usage remains fragmented across domains, and no comprehensive study currently consolidates existing efforts. This paper presents a systematic survey of log usage, focusing on Database (DB), Data Warehouse (DW), Web, and KG logs. More than 300 publications were analyzed to address three central questions: (1) do different types of logs share common structural and functional characteristics? (2) are there standard pipelines for their usage? (3) which constraints and non-functional requirements (NFRs) guide their exploitation?. The survey reveals a limited number of end-to-end approaches, the absence of standardization across log usage pipelines, and the existence of shared structural elements among different types of logs. By consolidating existing knowledge, identifying gaps, and highlighting opportunities, this survey provides researchers and practitioners with a comprehensive overview of log usage and sheds light on promising directions for future research, particularly regarding the exploitation and democratization of KG logs.
- Asia > Nepal (0.04)
- Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.04)
- Africa > Middle East > Algeria > Algiers Province > Algiers (0.04)
- Overview (1.00)
- Research Report > New Finding (0.46)
BEAVER: An Enterprise Benchmark for Text-to-SQL
Chen, Peter Baile, Wenz, Fabian, Zhang, Yi, Kayali, Moe, Tatbul, Nesime, Cafarella, Michael, Demiralp, Çağatay, Stonebraker, Michael
Existing text-to-SQL benchmarks have largely been constructed using publicly available tables from the web with human-generated tests containing question and SQL statement pairs. They typically show very good results and lead people to think that LLMs are effective at text-to-SQL tasks. In this paper, we apply off-the-shelf LLMs to a benchmark containing enterprise data warehouse data. In this environment, LLMs perform poorly, even when standard prompt engineering and RAG techniques are utilized. As we will show, the reasons for poor performance are largely due to three characteristics: (1) public LLMs cannot train on enterprise data warehouses because they are largely in the "dark web", (2) schemas of enterprise tables are more complex than the schemas in public data, which leads the SQL-generation task innately harder, and (3) business-oriented questions are often more complex, requiring joins over multiple tables and aggregations. As a result, we propose a new dataset BEAVER, sourced from real enterprise data warehouses together with natural language queries and their correct SQL statements which we collected from actual user history. We evaluated this dataset using recent LLMs and demonstrated their poor performance on this task. We hope this dataset will facilitate future researchers building more sophisticated text-to-SQL systems which can do better on this important class of data.
- Asia > Middle East > UAE (0.05)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Texas > Jack County (0.04)
- (3 more...)
Optimal Decision Making Through Scenario Simulations Using Large Language Models
The rapid evolution of Large Language Models (LLMs) has markedly expanded their application across diverse domains, transforming how complex problems are approached and solved. Initially conceived to predict subsequent words in texts, these models have transcended their original design to comprehend and respond to the underlying contexts of queries. Today, LLMs routinely perform tasks that once seemed formidable, such as writing essays, poems, stories, and even developing software code. As their capabilities continue to grow, so too do the expectations of their performance in even more sophisticated domains. Despite these advancements, LLMs still encounter significant challenges, particularly in scenarios requiring intricate decision-making, such as planning trips or choosing among multiple viable options. These tasks often demand a nuanced understanding of various outcomes and the ability to predict the consequences of different choices, which are currently outside the typical operational scope of LLMs. This paper proposes an innovative approach to bridge this capability gap. By enabling LLMs to request multiple potential options and their respective parameters from users, our system introduces a dynamic framework that integrates an optimization function within the decision-making process. This function is designed to analyze the provided options, simulate potential outcomes, and determine the most advantageous solution based on a set of predefined criteria. By harnessing this methodology, LLMs can offer tailored, optimal solutions to complex, multi-variable problems, significantly enhancing their utility and effectiveness in real-world applications. This approach not only expands the functional envelope of LLMs but also paves the way for more autonomous and intelligent systems capable of supporting sophisticated decision-making tasks.
Towards augmented data quality management: Automation of Data Quality Rule Definition in Data Warehouses
Tamm, Heidi Carolina, Nikiforova, Anastasija
In the contemporary data-driven landscape, ensuring data quality (DQ) is crucial for deriving actionable insights from vast data repositories. The objective of this study is to explore the potential for automating data quality management within data warehouses as data repository commonly used by large organizations. By conducting a systematic review of existing DQ tools available in the market and academic literature, the study assesses their capability to automatically detect and enforce data quality rules. The review encompassed 151 tools from various sources, revealing that most current tools focus on data cleansing and fixing in domain-specific databases rather than data warehouses. Only a limited number of tools, specifically ten, demonstrated the capability to detect DQ rules, not to mention implementing this in data warehouses. The findings underscore a significant gap in the market and academic research regarding AI-augmented DQ rule detection in data warehouses. This paper advocates for further development in this area to enhance the efficiency of DQ management processes, reduce human workload, and lower costs. The study highlights the necessity of advanced tools for automated DQ rule detection, paving the way for improved practices in data quality management tailored to data warehouse environments. The study can guide organizations in selecting data quality tool that would meet their requirements most.
- Europe > Spain (0.04)
- Europe > Estonia > Tartu County > Tartu (0.04)
- North America > United States > Texas > Andrews County (0.04)
- (4 more...)
- Research Report > New Finding (1.00)
- Overview (1.00)
- Information Technology > Services (1.00)
- Banking & Finance (1.00)
- Information Technology > Software (0.93)
- Information Technology > Security & Privacy (0.93)
Translating Natural Language Queries to SQL Using the T5 Model
Wong, Albert, Pham, Lien, Lee, Young, Chan, Shek, Sadaya, Razel, Khmelevsky, Youry, Clement, Mathias, Cheng, Florence Wing Yau, Mahony, Joe, Ferri, Michael
This paper presents the development process of a natural language to SQL model using the T5 model as the basis. The models, developed in August 2022 for an online transaction processing system and a data warehouse, have a 73\% and 84\% exact match accuracy respectively. These models, in conjunction with other work completed in the research project, were implemented for several companies and used successfully on a daily basis. The approach used in the model development could be implemented in a similar fashion for other database environments and with a more powerful pre-trained language model.
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.15)
- North America > Canada > British Columbia > Regional District of Central Okanagan > Kelowna (0.14)
- North America > Canada > Ontario > National Capital Region > Ottawa (0.14)
- Asia > India > Goa > Panaji (0.04)
- Overview (1.00)
- Research Report (0.87)
Towards Avoiding the Data Mess: Industry Insights from Data Mesh Implementations
Bode, Jan, Kühl, Niklas, Kreuzberger, Dominik, Hirschl, Sebastian, Holtmann, Carsten
With the increasing importance of data and artificial intelligence, organizations strive to become more data-driven. However, current data architectures are not necessarily designed to keep up with the scale and scope of data and analytics use cases. In fact, existing architectures often fail to deliver the promised value associated with them. Data mesh is a socio-technical, decentralized, distributed concept for enterprise data management. As the concept of data mesh is still novel, it lacks empirical insights from the field. Specifically, an understanding of the motivational factors for introducing data mesh, the associated challenges, implementation strategies, its business impact, and potential archetypes is missing. To address this gap, we conduct 15 semi-structured interviews with industry experts. Our results show, among other insights, that organizations have difficulties with the transition toward federated governance associated with the data mesh concept, the shift of responsibility for the development, provision, and maintenance of data products, and the comprehension of the overall concept. In our work, we derive multiple implementation strategies and suggest organizations introduce a cross-domain steering unit, observe the data product usage, create quick wins in the early phases, and favor small dedicated teams that prioritize data products. While we acknowledge that organizations need to apply implementation strategies according to their individual needs, we also deduct two archetypes that provide suggestions in more detail. Our findings synthesize insights from industry experts and provide researchers and professionals with preliminary guidelines for the successful adoption of data mesh.
- Europe > Germany > Bavaria > Upper Franconia > Bayreuth (0.04)
- North America > United States (0.04)
- Information Technology > Information Management (1.00)
- Information Technology > Artificial Intelligence (1.00)
- Information Technology > Data Science > Data Mining > Big Data (0.96)
Top 19 Skills You Need to Know in 2023 to Be a Data Scientist - KDnuggets
If you want to be a data scientist in 2023, there are several new skills you should add to your roster, as well as the slew of existing skills you should have already mastered. Part of the problem is job scope creep. Nobody knows what a data scientist is, or what one should do, least of all your future employer. So anything that has data gets stuck in the data science category for you to deal with. You're expected to know how to clean, transform, statistically analyze, visualize, communicate, and predict data.
Trends That Will Impact Data Analytics, AI, And Cloud In 2023 - Liwaiwai
As we enter 2023, the world of analytics, AI, and cloud is entering an exciting new phase, with a wide range of innovations and developments set to reshape the landscape. Below are some trends that will have the most impact in the coming year. In 2023, as global economic uncertainty continues, enterprises with data-intensive workloads in the cloud will need to review their cloud strategies with a greater focus on cost optimization. Cloud spending will be more closely scrutinized based on the ROI and TCO of existing projects or new investments. One area where cost optimization is particularly important in the coming year is data transfer egress costs, which can make up a significant portion of an organization's cloud bill.
Your Data Architecture Holds the Key to Unlocking AI's Full Potential
In the words of J.R.R. Tolkien, "shortcuts make long delays." I get it, we live in an age of instant gratification, with Doordash and Grubhub meals on-demand, fast-paced social media and same-day Amazon Prime deliveries. But I've learned that in some cases, shortcuts are just not possible. Such is the case with comprehensive AI implementations; you cannot shortcut success. Operationalizing AI at scale mandates that your full suite of data–structured, unstructured and semi-structured get organized and architected in a way that makes it useable, readily accessible and secure.
ETL vs ELT: Which One is Right for Your Data Pipeline? - KDnuggets
ETL and ELT are data integration pipelines that transfer data from multiple sources to a single centralized source and perform some transformation and processing steps to it. The difference between these two is ETL transforms the data before loading, and ELT transforms the data after loading. But before diving deeply into them, let's first understand the meaning of E, L, and T. T for Transform - Transforming the data is a process of cleaning and modifying the data in a format so that it can be used for business analysis. L for Loading - It involves loading data to a target system, which may be a data warehouse or a database. ETL is the first standardized data integration method that emerged in the 1970s due to the evolution of disk storage.